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Abstract — The existing upper and lower bounds between 
entropy and error are mostly derived through an inequality 
means without linking to joint distributions. In fact, from either 
theoretical or application viewpoint, there exists a need to achieve 
a complete set of interpretations to the bounds in relation to 
joint distributions. For this reason, in this work we propose a 
new approach of deriving the bounds between entropy and error 
from a joint distribution. The specific case study is given on 
binary classifications, which can justify the need of the proposed 
approach. Two basic types of classification errors are investigated, 
namely, the Bayesian and non-Bayesian errors. For both errors, 
we derive the closed-form expressions of upper bound and lower 
bound in relation to joint distributions. The solutions show that 
Fano's lower bound is an exact bound for any type of errors in a 
relation diagram of "Error Probability vs. Conditional Entropy". 
A new upper bound for the Bayesian error is derived with respect 
to the minimum prior probability, which is generally tighter than 
Kovalevskij's upper bound. 

Index Terms — Entropy, error probability, Bayesian errors, 
analytical, upper bound, lower bound 



I. Introduction 

In information theory, the relations between entropy and er- 
ror probability are one of the important fundamentals. Among 
the related studies, one milestone is Fano's inequality (also 
known as Fano's lower bound on the error probability of 
decoders), which was originally proposed in 1952 by Fano, but 
formally published in 1961 0"). It is well known that Fano's 
inequality plays a critical role in deriving other theorems 
and criteria in information theory [2|[3|[4|. However, within 
the research community, it has not been widely accepted 
exactly who was first to develop the upper bound on the 
error probability |0. According to [0 [0, Kovalevskij [8| 
was recognized as the first to derive the upper bound of the 
error probability in relation to entropy in 1965. Later, several 
researchers, such as Chu and Chueh in 1966 [9|, Tebbe and 
Dwyer III in 1968 El, Hellman and Raviv in 1970 fTR . 
independently developed upper bounds. 

The upper and lower bounds of error probability have been a 
long-standing topic in studies on information theory lfl2l |[T3"1 
lIHl d5l |H6l OS IH21 EOldU EJEU- However, we consider 
two issues that have received less attention in these studies: 
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II. 



What are the closed-form relations between each bound 
and joint distributions in a diagram of entropy and error 
probability? 

What are the lower and upper bounds in terms of the 
non-Bayesian errors if a non-Bayesian rule is applied in 
the information processing? 



The first issue implies a need for a complete set of interpre- 
tations to the bounds in relation to joint distributions, so that 
both error probability and its error components are known for 
interpretations. We will discuss the reasons of the need in the 
later sections of this paper. Up to now, most existing studies 
derived the bounds through an inequality means without using 
joint distribution information. Therefore, their bounds are not 
described by a generic relation to joint distributions. Using 
the truncated-distribution approach, a significant study by 
Ho and Verdu [21] was reported recently on established the 
relations for general cases of variables with finite alphabets 
and countably infinite alphabets. Regarding the second issue, 
to our best knowledge, it seems that no study is shown in 
open literature on the bounds in terms of the non-Bayesian 
errors. We will define the Bayesian and non-Bayesian errors 
in Section III. The non-Bayesian errors are also of importance 
because most classifications are realized within this category. 

The issues above form the motivation behind this work. 
We take binary classifications as a problem background since 
it is more common and understandable from our daily-life 
experiences. Moreover, we intend to simplify settings within 
a binary state and Shannon entropy definitions for a case study 
from an expectation that the central principle of the approach 
is well highlighted by simple examples. The novel contribution 
of the present work is given from the following three aspects: 

I. A new approach is proposed for deriving bounds directly 
through the optimization process based on a joint dis- 
tribution, which is significantly different from all other 
existing approaches. One advantage of using the approach 
is a possible solution of closed-form expressions to the 
bounds. 

II. A new upper bound in a diagram of "Error Probability vs. 
Conditional Entropy" for the Bayesian errors is derived 
with a closed-form expression in the binary state, which 
is not reported before. The new bound is generally tighter 
than Kovalevskij's upper bound. 
III. The comparison study on the bounds in terms of the 
Bayesian and non-Bayesian errors are made in the binary 
state. The connections of bounds are explored for a first 
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time between two types of errors. 
In the first aspect, we also conduct the actual derivation 
using a symbolic software tool, which presents a standard 
and comprehensive solution in the approach. The rest of this 
paper is organized as follows. In Section II, we present related 
works on the bounds. For a problem background of binary 
classifications, several related definitions are given in Section 
III. The bounds are given and discussed for the Bayesian 
and non-Bayesian errors in Sections IV and V, respectively. 
Interpretations to some key points are presented in Section VI. 
Finally, in Section VII we conclude the work and present some 
discussions. The source code from using symbolic software for 
the derivation is included in Appendixes A and B. 

II. Related Works 

Two important bounds are introduced first, which form the 
baselines for the comparisons with the new bounds. They 
were both derived from inequality conditions] 1| [8 1. Suppose 
the random variables X and Y representing input and output 
messages (out of m possible messages), and the conditional 
entropy H(X\Y) representing the average amount of infor- 
mation lost on X when given Y. Fano's lower bound |T) is 
given in a form of: 

H(X\Y) <H(P e ) + PJog 2 (m~l), (1) 

where P e is the error probability (sometimes, also called error 
rate or error for short), and H(P e ) is the binary entropy 
function defined by l22l : 

H(P e ) = -P e log 2 Pe - (1 - Pe)log 2 {l ~ Pe)- (2) 

The base of the logarithm is 2 so that the units are bits. 

The upper bound is given by Kovalevskij 10 in a piecewise 
linear form iflO) : 

H(X\Y) > log 2 k + k(k + \)(log 2 h£-)(P e - ^±), 

and k < to, to > 2, 

(3) 

where k is a positive integer number, but defined to be smaller 
than to. For a binary classification (to = 2), Fano-Kovalevskij 
bounds become: 

H~\p e) < Pe <^n, ( 4) 

where H~ 1 (P e ) is an inverse of H(P e ). Feder and Merhav 
11231 depicted bounds of eq. (4) and presented interpretations 
on the two specific points from the background of data 
compression problems. 

Studies from the different perspectives have been reported 
on the bounds between error probability and entropy. The 
initial difference is made from the entropy definitions, such 
as Shannon entropy in ifLTl l 14II24II25I. and Renyi entropy 
in lfT31 ll6ll 1171 . The second difference is the selection of bound 
relations, such as "P e vs. H(X\Y)" in flUED, "H(X\Y) vs. 
P e " in d CHE) COED, "P e vs. MI(X,Y)" in MfZH, 
and "NMI(X, Y) vs. A" in [24], where A is the accuracy rate, 
MI(X,Y) and NMI(X,Y) are the mutual information and 
normalized mutual information between variables X and Y, 
respectively. Another important study is made on the tightness 
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Fig. 1 . Schematic diagram of the pattern recognition systems (modifications 
on FIGURE 1.7 in (29)). 

of bounds. Several investigations [17| |18| |20| [21] have been 
reported on the improvement of bound tightness. Recently, a 
study in fl25ll suggested that an upper bound from the Bayesian 
errors should be added, which is generally neglected in the 
bound analysis. 

III. Binary Classifications and Related 
Definitions 

Classifications can be viewed as one component in pattern 
recognition systems [29 1 . Fig. 1 shows a schematic diagram of 
the pattern recognition systems. The first unit in the systems is 
termed representation in the present problem background, but 
called encoder in communication background. This unit pro- 
cesses the tasks of feature selection, ox feature extraction. The 
second unit is called classification or classifier in applications. 
Three sets of variables are involved in the systems, namely, 
target variable T, feature variables X, and prediction variable 
Y . While T and Y are univariate discrete random variables for 
representing labels of the samples, X can be high-dimension 
random variables either in forms of discrete, continuous, or 
their combinations. 

In this work, binary classifications are considered as a case 
study because they are more fundamental in applications. 
Sometimes, multiclass classifications are processed by binary 
classifiers [28|. In this section, we will present several neces- 
sary definitions for the present case study. Let x be a random 
sample satisfying x G X C R d , which is in a d-dimensional 
feature space and will be classified. The true (or target) state 
t of x is within the finite set of two classes, t G T — {£1^2}, 
and the prediction (or output) state y — /(x) is within the 
two classes, y G y = {yi,y 2 }, where / is a function for 
classifications. Let p{ti) be the prior probability of class ti 
and p(x\ti) be the conditional probability density function (or 
conditional probability) of x given that it belongs to class ti. 

Definition 1: (Bayesian error in binary classification) In a 
binary classification, the Bayesian error, denoted by P e , is 
defined by |29l : 

Pe = J P {t 1 \x)p{t 1 )dx + J p(t 2 \x) P (t 2 )dx, (5) 

R2 Ri 
where Ri is the decision region for class ti. The two regions 
are determined by the Bayesian rule: 

Decide Rl if > l, 

p( r xt A p t\ (6) 

Decide R 2 if P * 1P 1 < 1, 
p{x\t 2 )p(t 2 ) 

In statistical classifications, the Bayesian error is the theoret- 
ically lowest probability of error [29 1. 
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Fig. 2. Bayesian decision boundary for equal priors p(ti) in a binary 
classification (modifications on FIGURE 2.17 in ['291). 
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Fig. 3. Graphic diagram of the probability transformation between variables 
T and Y in a binary classification. 



Definition 2: (Non-Bay esian error) The non-Bay esian error, 
denoted by Pe, is defined to be any error which is larger than 
the Bayesian error, that is: 



P E > P e , 



(7) 



for the given information of p(ti) and p(x|t i ). 

Remark 1: Based on the definitions above, for the given 
joint distribution the Bayesian error is unique, but the non- 
Bayesian errors are multiple. Fig. 2 shows the Bayesian 
decision boundary, Xb, on a univariate feature variable x for 
equal priors. The Bayesian error is P e = e\ + e?,. Any other 
decision boundary different from Xf, will generate the non- 
Bayesian error for Pe > P e . 

In a binary classification, the joint distribution, p(t, y) = 
p(t = U,y — yj) — pij, is given in a general form of: 



p n =p\ - ei, pa = ei, 
P21 = e 2 , P22 =P2 — e 2 , 



(8) 



where pi = p(ti) and p 2 = pit^) are the prior probabilities of 
Class 1 and Class 2, respectively; their associated errors (also 
called error components) are denoted by e\ and e%. Fig. 3 
shows a graphic diagram of the probability transformation be- 
tween target variable T and perdition variable Y via their joint 
distribution p(t, y) in a binary classification. The constraints 
in eq. (8) are given by [29|: 

= 1 



<pi < 1, <p 2 < 1, Pl +P2 

< e\ < pi, < e 2 < P2- 



(9) 



In this work, we use e to denote error probability, or error 
variable, for representing either the Bayesian error or non- 



Bayesian error. They are calculated from the same formula: 

e(P e , or P E )^e 1 +e 2 . (10) 

Definition 3: (Minimum and maximum error bounds in 
binary classifications) Classifications suggest the minimum 
error bound as: 



\PeJmin ■ 0? 



(ii) 



where the subscript min denotes the minimum value. The 
maximum error bound for the Bayesian error in binary clas- 
sifications is ||251 : 



(Pe)r 



Pn 



l{Pl,P2}, 



(12) 



where the symbol min denotes a minimum operation. For the 
non-Bayesian error, its maximum error bound becomes 



(Pe), 



= 1. 



(13) 



Remark 2: For a given set of joint distributions in the bound 
studies, one may fail to tell if it is the solution from using the 
Bayesian rule or not. For simplification, we distinguish the 
set to be one for the Bayesian errors if an error rate e always 
satisfies the relation of e < p m in- Otherwise, it is a set for the 
non-Bayesian errors. 

In a binary classification, the conditional entropy, H(T\Y), 
is calculated from the joint distribution in (8): 



H(T\Y) 



H(T) - MI(T,Y) 
-p\log 2 p\ - P2log2P2 

£ilog 2 (p 2+ei _ e2 -) pi 
-e2log 2{pi _ e e 1 2 +e2)p2 
-(Pi-ei)* 0ga , ^- ei > 



(14) 



' (pi— ei+eajpi 

-(p2-e 2 )log 2 (p2 { l 2 ~Zl )p2 : 



where H(T) is a binary entropy of the random variable T, and 
MI(T,Y) is mutual information between variables T and Y. 

Remark 3: When a joint distribution p(t, y) is given, its as- 
sociated conditional entropy H(T\Y) is uniquely determined. 
However, for the given H(T\Y), it is generally unable to 
reach a unique solution to p(t, y), but mostly multiple solutions 
shown later in this work. 

Definition 4: (Admissible point, admissible set, and their 
properties in diagram of entropy and error probability) In a 
given diagram of entropy and error probability, if a point in 
the diagram is possibly to be realized from a non-empty set 
of joint distributions for the given classification information, 
it is defined to be an admissible point. Otherwise, it is a non- 
admissible point. All admissible points will form an admissible 
set (or admissible region(s)), which is enclosed by the bounds 
(also called boundary). If every point located on the boundary 
is admissible (or non-admissible), we call this admissible set 
closed (or open). If only a partial portion of boundary points is 
admissible, the set is said partially closed. For an admissible 
point with the given conditions, if it is realized only by a 
unique joint distribution, it is called a one-to-one mapping 
point. If more than one joint distribution is associated to the 
same admissible point, it is called a one-to-many mapping 
point. 
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We consider that classifications present an exemplary justi- 
fication of raising the first issue in Section I about the bound 
studies. The main reason behind the issue is that a single 
index of error probability may not be sufficient for dealing 
with classification problems. For example, when processing 
class-imbalance problems [30||31|, we need to distinguish 
error types. In other words, for the same error probability 
e (or even the same admissible point), we are required to 
know the error components of e\ and e 2 as well. Suppose one 
encounters a medical diagnosis problem, where p\ generally 
represents the majority class for healthy persons (labeled 
with negative or -1 in Fig. 3), and p 2 the minority class 
for abnormal persons (labeled with positive or 1). A class- 
imbalance problem is then formed. While e\ (also called type 
I error ) is tolerable, e 2 (or type II error) seems intolerable 
because abnormal persons are considered to be "healthy". 
Hence, from either theoretical or application viewpoint, it is 
necessary for establishing relations between bounds and joint 
distributions, which can provide error type information within 
error probability for better interpretations to the bounds . 

IV. Upper and lower bounds for Bayesian errors 

In this work, we select the bound relations between entropy 
and error probability. Furthermore, The bounds and their 
associated error components are also given by the following 
two theorems in a context of binary classifications. 

Theorem 1: (Lower bound and associated error compo- 
nents) The lower bound in a diagram of "P e vs. H(T\Y)" 
and the associated error components are given by: 



P e > min{0,Gi(H(T\Y))}, 
forG^(P e ) =H(T\Y) 



(15a) 



= ~P e log 2 P e - (1 - P e )log 2 (l - P e ), 

Pe = ei + e 2 < Pmin, 

(15b) 



(ei,e 2 ) 



(0.5,0) or (0,0.5), 



if P e = 0.5, 



^ p e ( l j>i^ p e ) ^ p e (p^p e ) ^ ^ otherwise, 



(15c) 

where H(T\Y) is the conditional entropy of of T when given 
Y , and G\ is called the lower bound function (or lower bound). 
However, one can only achieve the closed-form solution on its 
inverse function, G± (•), not on itself. 

Proof: Based on eq. (14), the lower bound function is 
derived from the following definition: 

G^(e) = argmaxH(T\Y), 

e 

subject to eqs. (9) and (10), 



(16) 



where we take e for the input variable in the derivations. Eq. 
(16) describes the function of the maximum H(T\Y) with 
respect to e, and the function needs to satisfy the general 
constraints of joint distributions in eq. (9). H(T\Y) seems to 
be governed by the four variables from pi and in eq. (14). 
However, only two independent parameter variables determine 
the solutions of (14) and (16). The variable reduction from four 
to two is due to the two specific constrains imposed between 
parameters, that is, p\ + p 2 = 1 and e% + e 2 = e. When we 



set pi and ei as two independent variables, eq. (16) is then 
equivalent to solving the following problem: 

G^ipud) = argm&xH(T\Y), 



subject to eqs. (9) and (10). 



(17) 



G i {p\, ex) is a continuous and differentiable function with 
respect to the two variables. A differential approach is applied 
analytically for searching the critical points of the optimiza- 
tions in eq. (17). We achieve the two differential equations 
below and set them to be zeros: 



(pi-ei)(P e -ei)(l+2ei-pi-P e ) 2 



dH(T\Y) _ , 

Oei lU y 2 ei(l+ei-pi-P.)(pi+P E 

dH{T\Y) _ 



log: 



(pi-2 ei +P e )(l+ ei -pi-P e ) 
((P!-e 1 )(l+2e 1 -pi-P c )) 



-2ei) 



= 0, 



= 0. 



(18) 

By solving them simultaneously, we obtain the three pairs of 
the critical points through analytical derivations: 



ei 
Pi 



P e (i-Pi-P e ) 

l-2P e ' 
P e +2e 1 P c -e 1 -P^ 



_ pi(pi+P g -l) 
ei — 2Pi-l ' 

Pi = ^ + e l + 



(19a) 



yi + pi + AeX-AexP. 



_ Pl(pi+P e -1) 

ei 2Pi-l ' 



2P e , 
(19b) 



■ Aef - 4 ei P e - 2P e . 

(19c) 

The highest order of each variable, e\ and p\, in eq. (18) 
is four. However, we can see the component within the first 
function in eq. (18), (1 + 2ei — p\ — P e ) 2 , will degenerate 
the total solution order from four to three. Therefore, the 
three pairs of critical points exhibit a complete set of possible 
solutions to the problem in eq. (17). The final solution should 
be the pair(s) that satisfies both the maximum H(T\Y) with 
respect to e\ for the given e — P e and the constraints. Due 
to high complexity of the nonlinearity of the second-order 
partial differential equations on H(T\Y), it seems intractable 
to examine the three pairs analytically for the final solution. 

To overcome the difficulty above, we apply a symbolic 
software tool, Maple™9.5 (a registered trademark of Waterloo 
Maple, Inc.), for a semi-analytical solution to the problem (see 
Maple code in Appendix A). For simplicity and without loss of 
generality in classifications, we consider p\ and P e are known 
constants in the function. The concavity property of H(T\Y) 
with respect to e^ in the ranges defined in eq. (9) is confirmed 
numerically by varying data on p\ and P e . A single maximum 
solution on H(T\Y) is always obtained, but it is described by 
the two sets of e\ in (19) alternatively in different conditions 
of pi and P e . ■ 

Remark 4: Theorem 1 achieves the same lower bound 
found by Fano [ 1 1 (Fig. 4), which is general for finite alphabets 
(or multiclass classifications). One specific relation to Fano's 
bound is given by the marginal probability (see eq. (2-144) in 
©): 

p(y) = (i-Pe,^ I ,-,^ I ), (20) 

which is termed sharp for attaining equality in eq. (1) fl2j. 
We call Fano's bound an exact lower bound because every 
point on it is sharp. The sharp conditions in terms of error 
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components in (15c) are a special case of the study in ll2D . 
and can be derived directly from their Theorem 1. 

Theorem 2: (Upper bound and associated error compo- 
nents) The upper bound and the associated error components 
are given by: 



for G 2 



P e < min{p 
L (e) =H(T\Y) 



l ,G 2 (H(T\Y))}, 



(21a) 



= -PmiJog 2 P ?+p^ in - p elog 2 Pe+p ° ti 



and P e 



(21b) 



ei 

Pi: 



f e 2 < Pmin, 

e-j = 0, Pi> Pj, 



1,2 
(21c) 

where G 2 is called the upper bound function (or upper bound). 
Again, the closed-form solution can be achieved only on its 
inverse function of G 2 " 1 (-). 

Proof: The upper bound function is obtained from solving 
the following equation: 



G 2 \pi, 



arg min H(T\Y), 

e—P e 



subject to eqs. (9) and (10). 



(22) 



Because the concavity property holds for H(T\Y) with respect 
to e\ for the constraints defined in eq. (9), the possible 
solutions of ei should be located at the two ending points of its 
feasible range, (0, P e ). We can take the point which produces 
the smaller H(T\Y) as the final solution. The solution from 
Maple code shown in Appendix B confirms the closed-form 
expressions in eq. (21). ■ 

Remark 5: Theorem 2 describes a novel set of upper 
bounds which is in general tighter than Kovalevskij's bound 
|8| for binary classifications (Fig. 4). For example, when 
Pmin = 0.2 is given, the upper bounds defined in eq. (21) 
shows a curve "O — C" plus a line "C — C". Kovalevskij's 
upper bound, given by a line "O — C — A", is sharp only 
at Point O and Point C. The solution in eq. (21c) confirms 
an advantage of using the proposed optimization approach in 
derivations so that a closed-form expression of the exact bound 
is possibly achieved. 

In comparison, Kovalevskij's upper bound described in eq. 
(3) is general for multiclass classifications. This bound misses 
a general relation to error components like eq. (21c), although 
the relation is restricted to a binary state. For distinguishing 
from the Kovalevskij's upper bound, we also call G 2 a curved 
upper bound. The new linear upper bound, (P e ) max — p m i n , 
shows the maximum error for the Bayesian decisions in binary 
classifications |25|, which is also equivalent to the solution of 
a blind guess when using the maximum-likelihood decision 
[29 1. If pi = p 2 , the upper bound becomes a single curved 
one. 

Remark 6: The lower and upper bounds defined by eqs. 
(15) and (21) form a closed admissible region in the diagram of 
"P e vs. H(X\Y)". The shape of the admissible region changes 
depending on a single parameter of p m i n . 

V. Upper and lower bounds for non-Bayesian 

ERRORS 

In classification problems, the Bayesian errors can be real- 
ized only if one has the exact information about all probability 



1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 



D 


















R 












Kovalevskij's 














upper bound 
Curved \ 














upper bound 

(p=n=0.5) 














i - z \ 

Curved 
















0.4) 






\ B 






upper bound 
in =n?i 






















c 


























^\ 


' C 


















Fano's 

lower bound 


Q 



H(T\Y) 
— ► 



0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 



Fig. 4. Plot of bounds in a "P e vs. H(T\Y)" diagram. 



distributions of classes. The assumption above is generally 
impossible in real applications. In addition, various classifiers 
are designed by employing the non-Bayesian rules, such 
as the conventional decision trees, artificial neural networks 
and supporting vector machines [29]. Therefore, the analysis 
of the non-Bayesian errors presents significant interests in 
classification studies. 

Definition 5: (Label-switching in binary classifications) In 
binary classifications, a label-switching operation is an ex- 
change between two labels. Suppose the original joint dis- 
tribution is denoted by: 



PA(t,y) : Pn = a, pu = b, 
P21 = c, p 22 = d. 



(23a) 



A label-switching operation will change the prediction labels 
in Fig. 3 to be y\ = 1 and y 2 = — 1, and generate the following 
joint distribution: 



PEs(t,y) ■ P11 = b, P12 = a, 
P21 = d, p 22 = c. 



(23b) 



Proposition 1: (Invariant property from label-switching) The 
related entropy measures, including H(T), H(Y), MI(T, Y), 
and H(T\Y), will be invariant to labels, or unchanged from 
a label-switching operation in binary classifications. However, 
the error e will be changed to be 1 — e. 

Proof: Substituting the two sets of joint distributions in 
eq. (23) into each entropy measure formula respectively, one 
can obtain the same results. The error change is obvious. ■ 

Theorem 3: (Lower bound and upper bound for non- 
Bayesian error without information of p\ and p 2 J In a context 
of binary classifications, when information about p\ and p 2 
is unknown (say, before classifications), the lower bound and 
upper bound for the non-Bayesian error are given by: 



Gi(tf(r|y)) < P E < 1 - Gi(ff(T|y)), 



(24a) 
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(0.5,0) or (0,0.5), if Pl = p 2 = P E = 0.5, 

(e u e 2 ) = { ( PEi lZ^/ E \ P i P J 2P P E E) ), i/(l-pi-P E )(pi-P E )(P E -0.5)>0, (24c) 
( M % P .T 1 \ il - p ^[ P i; PEy h otherwise, 



forG^{P E ) =H{T\Y) 

= -P E log2P E - (1 - PE)log 2 {l - Pe), 
Pe — ei + e 2 < 1, 

(24b) 

(see i/ie top o/ this page) (24c) 

where we call the upper bound in eq. (24a), 1 — Gi(H(T\Y)), 
the general upper bound (or mirrored lower bound), which is a 
mirror of Fano's lower bound with the mirror axis along Pe = 
0.5. Both bounds share the same expression for calculating the 
associated error components in eq. (24c). When Pe < 0.5, 
their components, e\ and e 2 , correspond to the lower bound, 
otherwise, to the upper bound. 

Proof: Suppose an admissible point is located at the lower 
bound which shows Pe < 0.5. By a label-switching operation, 
one can obtain the mirrored admissible point at 1 — Pe > 0.5, 
which is located at the mirrored lower bound. Proposition 1 
suggests both points share the same value of H(T\Y). Because 
Pe is the smallest one for the given conditional entropy 
H(T\Y), its mirrored point is the biggest one for creating 
the general upper bound. ■ 

Remark 7: Fano's lower bound, its mirror bound, and the 
axis of Pe form an admissible region, denoted by a boundary 
"O - F' - A - F — D - O" in Fig. 5, for the non-Bayesian 
error when information about p\ and p 2 is unknown. On the 
axis of Pe, only Points O and D are admissible. Hence, the 
admissible region is partially closed. 

Theorem 4: (Admissible region(s) for non-Bayesian error 
with known information ofp\ andpi) In binary classifications, 
when information about p\ and p2 is known, a closed admissi- 
ble region for the non-Bayesian error is generally formed (Fig. 
5) by Fano's lower bound, the general upper bound, the curved 
upper bound G 2 " 1 (-), the mirrored upper bound of G 2 " 1 (-), and 
the upper bound H(T\Y) max . For the H(T\Y) max bound, its 
associated error components are given by: 

for H[T\Y) = H(T\Y) max = H(e = p mm ), 

. f (0.25,0.25), if Pl = p 2 = Pe = 0.5, 

(ei,e 2 ) = < ( E i^ftl i ftg^gfl ja l )> otherwise . 

(25) 

Proof: Following the proof in Theorem 3, one can 
get the mirrored upper bound of (?2 The upper bound 
H(T\Y) max is calculated from the condition of H(T\Y) < 
H(T) J2). For the given p\ and p%, H(T\Y) max is a con- 
stant. Because H(T\Y) max also implies a minimization of 
MI(T,Y) in eq. (14), its associated error components can 
be obtained from the minimization relation of MI(T, Y) in 
forms of (see eq. (35) in |[33l ): 

gii _ Pig 

P21 P22 



(26) 




0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
Fig. 5. Plot of bounds in a "Pe vs. H(T\Y)" diagram. 



suggests no correlation II29I or statistically independent |2) 
between two variables T and Y. 

Remark 9: When information of p\ and p2 is known, the 
shape of the admissible region(s) is fully dependent on a single 
parameter p m in- Two closed admissible regions are formed 
only when p\ = P 2 (Fig. 5). One region is from Fano's lower 
bound and the upper bound. The other is from the mirrored 
upper bound and the general upper bound. In general, the non- 
Bayesian error Pe can be higher than Kovalevskij's bound. 

VI. Classification Interpretations to some key 

POINTS 

For better understanding the theoretical results from a 
background of classifications, interpretations are given to some 
key points shown in Figs. 4 and 5, respectively. Those key 
points may hold special features in classifications. 

Point O: This point represents a zero value of H(T\Y). It 
also suggests a perfect classification without any error (P e = 
Pe = 0) by a specific setting of the joint distribution: 



Pn = Pi, P12 = 0, 

P21=0, P22=P2- 



(27) 



This point is always admissible and independent of error types. 

Point A: This point shows the maximum ranges of 
H(T\Y) = 1 for class-balanced classifications (pi = P2). 
Three specific classification settings can be obtained for repre- 
senting this point. The two settings from eq. (24c) are actually 
no classification: 



Remark 8: Eqs. (25) and (26) equivalently imply a zero 
value for the mutual information, MI(T,Y) = 0, which 



pn = 1/2, Pl2 = 0, or p u = 0, P i 2 = 1/2, 

P21 = 1/2, P 22 = 0, P21 = 0, P 22 = 1/2. 



(28) 
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They also indicate zero information [32| from the classification 
decisions. The other setting is a random guessing from eq. 
(25): 

Pu = 1/4, P12 = 1/4, , s 

P21 = 1/4, P22 = 1/4. 

For the Bayesian errors, this point is always included by both 
Fanos' bound and Kovalevskij's bound. However, according to 
the upper bounds defined in (21a), this point is non-admissible 
whenever the relation of p\ = p2 does not hold. For the 
non-Bayesian errors, the point is either admissible or non- 
admissible depending on the given information about pi and 
P2- This example suggests that the admissible property of 
a point should generally rely on the given information in 
classifications. 

Point D: This point occurs for the non-Bayesian classifica- 
tions in a form of: 

Pll =0, Pl2 = Pl , (30) 

P21 = P2, P22 = 0. 

In this case, one can exchange the labels for a perfect classi- 
fication. 

Point B: This point is located at the corner formed by the 
curved and linear upper bounds, with H(T\Y) — 0.8 and 
e = 0.4. In apart from Point O, this is another point obtained 
from eq. (21) that sets at Kovalevskij's upper bound. The 
point can be realized from either Bayesian or non-Bayesian 
classifications. Suppose p\ > p2 — 0.4 for the Bayesian 
classifications. One will achieve Point B by a classification: 

= 0.2, Pl2 = 0.4, 

P2l = 0, P22 = 0.4, V ' 

for a one-to-one mapping. In other words, the point becomes 
non-admissible whenever p m i n ^ 0.4. If the non-Bayesian 
errors are considered, this point will possess a one-to-many 
mapping. For example, one can get another setting from 
solving H{p m i n ) = 0.8 for p m i n first. Then, by substituting 
the relations of p2 — p m \ n and Pe = 0.4 into eq. (25), one 
can get the error components. The numerical results show the 
approximation solutions with p mjn « 0.2430, e± ss 0.2312, 
and e 2 ~ 0.1688 for another setting of Point B. 

Point B'\ The point located at the lower bound, like 
Point B 1 , will produce a one-to-many mapping for either the 
Bayesian errors or non-Bayesian errors. One specific setting 
in terms of the Bayesian errors is: 

Pn = 0.6, pu = 0, 

P21 = 0.4, p 2 2 = 0, 

which suggests zero information from classifications. More 
settings can be obtained from eq. (15). For example, if given 
Pi — 0.55, P2 — 0.45 and P e = 0.4, one can have: 

Pn = 0.45, pu = 0.1, , , 

P21=0.3, p 22 =0.15. ^ 

The non-Bayesian errors will enlarge the set of one-to-many 
mapping for an admissible point of the Bayesian errors due to 
the relaxed condition of (13). One setting is for the balenced 
error components: 

Pu = 0.3, pis = 0.2, 

P2l = 0.2, P 22 = 0.3. 



(32) 



Eq. (24c) will be applicable for deriving a specific setting 
when pi and Pe are given. For example, two settings can be 
obtained: 

if Pl =0.25, P E = 0A, 
then ei = 0.175, e 2 = 0.225, 



(35) 



if pi = 0.3, P E = 0.4, 
then ei = 0.225, e 2 = 0.175. 



(36) 



for representing the same point, Point B', which is located at 
H(T\Y) w 0.9710 and P E = 0.4 in the diagram (Fig. 4). 

Points E and E': All points located at the general upper 
bound, like Point E, will correspond to the settings from the 
non-Bayesian errors. If a point located at the lower bound, 
say E', it can represent settings from either the Bayesian 
or non-Bayesian errors depending on the given information 
in classifications. Points E and E' form the mirrored points. 
Their settings can be connected by a relation in (23), but not a 
necessary. For example, one specific setting for Point E' with 
Pi = 0.3 and p 2 — 0.7 is: 



Pu = 0, pu = 0.3, 

P21 = 0, _P22 = 0.7, 

the other for Point E with p\ = 0.8 and p 2 = 0.2 is: 



Pu 
P21 



20 „ 
30' P12 

30' P 22 



30 ' 
30 • 



(37) 



(38) 



They are mirrored to each other but have no label-switching 
relation. 

Points A' and A"; When Pe — 0.5 and p m ;„ = 0.1, Points 
A' and A" form a pair as the ending points for the given 
conditions. Supposing p\ = 0.9 and p 2 = 0.1, one can get the 
specific setting for Point A' from eq. (21c): 



pu = 0.4, pi 2 = 0.5, 
P21 = 0, p 22 = 0.4, 

and one for Point A" from eq. (25): 

Pll = 0.45, p 12 = 0.45, 
P 2i = 0.05, P22 = 0.05. 



(39) 



(40) 



(34) 



Points Q and R: The two points are specific due to their 
positions in the diagrams. For either type of errors, both points 
are non-admissible in the diagrams, because no setting exists 
in binary classifications which can represent the points. 

VII. Summary and discussions 

This work investigates into upper and lower bounds between 
entropy and error probability. An optimization approach is 
proposed to the derivations of the bound functions from a 
joint distribution. As a preliminary work, we consider binary 
classifications for a case study. Through the approach, a 
new upper bound is derived and shows tighter in general 
than Kovalevskij's upper bound. The closed-form relations 
between bounds and error components are presented. The 
analytical results lead to a better understanding about the sharp 
conditions of bounds in terms error components. Because 
classifications involve either Bayesian errors or non-Bayesian 
ones, we demonstrate the bounds comparatively for both types 
of errors. 
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We recognize that analytical tractability is an issue for the 
proposed approach. Fortunately, a symbolic software tool is 
helpful for solving complex problems successfully with dif- 
ferent semi-analytical means (such as in [34|[35|). The semi- 
analytical solution used in this work refers to the analytical 
derivation of possible solutions but the numerical verification 
of the final solution. 

To emphasize the importance of the study, we present 
discussions below from the perspective of machine learning in 
big-data classifications. We consider that binary classifications 
will be one of key techniques to implement a divide-and- 
conquer strategy for efficiently processing large quantities of 
data. Class-imbalance problems with extremely-skewed ratios 
are mostly formed from a one-against-other division scheme 
for binary classes. Researchers, of course, concern error types 
in classification performance. The knowledge of bounds in 
relation to error components is desirable for theoretical and 
application purposes. 

From a viewpoint of machine learning, the bounds derived 
in this work provide a basic solution to link learning targets 
between error and entropy in the related studies. Error-based 
learning is more conventional because of its compatibility 
with our intuitions in daily life, such as "trial and error". 
Significant studies have been reported under this category. In 
comparison, information-based learning [36| is relatively new 
and uncommon in some applications, such as classifications. 
Entropy is not a well-accepted concept related to our intuition 
in decision making. This is one of the reasons why the 
learning target is chosen mainly based on error, rather than 
on entropy. However, we consider that error is an empirical 
concept, whereas entropy is theoretical and general. In 11371 . 
we demonstrated that entropy can deal with both notions 
of error and reject in abstaining classifications. Information- 
based learning [36 1 presents a promising and wider perspective 
for exploring and interpreting learning mechanisms. 

When considering all sides of the issues stemming from 
machine learning studies, we believe that "what to learn" is a 
primary problem. However, it seems that more investigation is 
focused on the issue of "how to learn", which should be put as 
the second-level problem. Moreover, in comparison with the 
long-standing yet hot theme of feature selection, little study 
has been done from the perspective of learning target selec- 
tion. We propose that this theme should be emphasized in the 
study of machine learning. Hence, the relations studied in this 
work are fundamental and crucial to the extent that researchers, 
using either error-based or entropy-based approaches, are able 
to reach a better understanding about its counterpart. 
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Appendix A 
Maple code for deriving the lower bound 

> restart; # Clean the memory 

> p2 : =l-pl ; e2 : =Pe-el ; # Describe the bound with respect to pi and el 

> HT : =-pl *log [ 2 ] (pi) -p2*log[2] (p2); # Shannon entropy 

> pll := (pl-el) ;pl2 :=el;p22 :=p2-e2;p21:=e2; # Terms of joint probability 

> ql : =pl l+p2 1 ; q2 : =pl2+p22 ; # Intermediate variables 

> MI:=pll*log[2] (pll/ql/pl) +pl2*log[2] (pl2/q2/pl) ; 

> MI=MI+p22*log [2] (p22/q2/ (1-pl) ) +p21*log[2] (p2 1 /ql / ( 1-pl ) ) ; # Mutual information 

> HTY : = (HT-MI ) ; # Conditional entropy 

> HTY_dif_pl :=simplify (combine (diff (HTY, pi) , In, symbolic)); # Differential w.r.t. pi 

/(pi - 2 el + Pe) (-1 + pi + Pe - el)\ 

In | | 

\ (pi - el) (-2 el - 1 + pi + Pe) / 

HTY_dif_pl := 

ln(2) 

> HTY_dif_el :=simplify (combine (diff (HTY, el) , In, symbolic)); # Differential w.r.t. el 

/ 2 \ 

I (pi - el) (-2 el - 1 + pi + Pe) (Pe - el) | 

In | | 

I 2 | 

\ (pi - 2 el + Pe) el (-1 + pi + Pe - el) / 

HTY_dif_el := 

ln(2) 

> solve ( {HTY_dif_pl=0, HTY_dif_el=0 } , {el, pi}); # not a complete set of 

# possible solutions 

/ 2 \ 

I Pe + el - Pe - 2 el Pel 
< el = el, pi = > 

I Pe | 

\ / 

> El : =solve (HTY_dif_el, el); # a complete set of possible solutions when pi is known 

Pe (-1 + pi + Pe) pi (-1 + pi + Pe) 

El := , 

2 Pe - 1 2 pi - 1 

> Pl_a:=solve (El [1] =el, { pi } ) ; P l_bc : =solve (El [ 2 ] =el , {pi}); # a complete set of possible 

# solutions when el is known 

/ 2 \ 

I Pe + el - Pe - 2 el Pel 

Pl_a := < pi = > 

I Pe | 

\ / 
/ (l/2)\ 
| 111/2 2\ | 

Pl_bc := < pi = el + - - - Pe + - \4 el - 4 el Pe + 1 - 2 Pe + Pe / > 
|2 2 2 | 

\ / 

/ (l/2)\ 
| 111/2 2\ | 

< pi = el + Pe - - \4 el - 4 el Pe + 1 - 2 Pe + Pe / > 

|2 2 2 | 

\ / 

> simplify (combine (simplify (eval (HTY, el=El [ 1 ])), In, symbolic) ) ; # failed to show it explici 

> simplify (eval (HTY, el=El[2])); # Display of the lower bound function in terms of pi 

pi ln(pl) + ln(l - pi) - ln(l - pi) pi 

ln(2) 

> # verification of concavity of HTY by a numerical way (changing Pe and pi arbitrarily 

> Pe : =0 . 5; pi : =0 . 6; plot (HTY_graph, el=0 . .Pe) ; # with the constraints) 
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Appendix B 
Maple code for deriving the upper bound 

> restart; # Clean the memory 

> HT : =-pl *log [ 2 ] (pi) -p2*log[2] (p2); # Shannon entropy 

> pi 1 : = (pl-el ) ; pl2 : =el ; p22 : =p2-e2 ; p2 1 : =e2 ; # Terms of joint distribution 

> # To examine the HTY on two ending points for e2, i.e., e2 = and e2=e 

> # For derivation of the upper bound function when e2=0 

> el : =e ; e2 : =0 ; pi : =l-p2 ; 

> ql : =pl l+p2 1 ; q2 : =pl2+p22 ; # Intermediate variables 

> MI:=pll*log[2] (pll/ql/pl) +pl2*log[2] (pl2/q2/pl) ; # Mutual information 

> MI:=MI+p22*log[2] (p22/q2/ ( 1-pl ) ) ; # Neglect one term when 0*log(0)=0 

> HTY_1 : =combine (simplify (combine (simplify (HT-MI ) , In, symbolic) ) ) ; 

> # Display of the upper bound function when e2=e 

/e + p2\ /e + p2\ 

p2 In | | + e In | | 

\ P 2 / \ e / 

HTY_1 := 

ln(2) 

> # For derivation of the upper bound function when e2=e 

> el : =0 ; e2 : =e ; 

> ql : =pl l+p2 1 ; q2 : =pl2+p22 ; # Intermediate variables 

> MI :=pll*log [2] (pll/ql/pl) ; # Neglect one term when 0*log(0)=0 

> MI:=MI+p22*log[2] (p22/q2/ (1-pl) ) +p21*log[2] (p2 1 /ql / ( 1-pl ) ) ; 

> HTY:=eval (HT-MI, p2=l-Pl) ; # Using PI for pi 

> HTY_2 : =combine (simplify (combine (simplify (HTY) , In, symbolic) ) ) ; 

> # Display of the upper bound function in terms of e and p2 

/ PI \ / e \ 

-PI In | | - e In | | 

\P1 + e/ \P1 + e/ 

HTY_2 := 

ln(2) 

> # To calculate the difference between HTY_1 and HTY_2 

> delta_HTY : =combine (simplify (HTY_1-HTY_2 ) , In, symbolic) ; 

/e + p2\ /e + p2\ / PI \ 

p2 In | | + e In | | + PI In | 

\ p2 / \P1 + e/ \P1 + e/ 

delta_HTY := 

ln(2) 

> # numerical verification of the solution to HTY below: 

> # changing p2 arbitrarily with the constraint 

> # when p2<0.5, delta_HTY<0, HTY_1 is the final solution, 

> # when p2>0.5, delta_HTY>0, HTY_2 is the final solution, 

> # when p2=0.5, delta_HTY=0, both are the solutions. 

> p2:=0.4;Pl:=l-p2;plot (delta_HTY, e=0 . .p2) ; 



